avatar

Google Play Store Data Analysis Report

0. Introduction

This data set is scraped from Google Play Store. There are 13 features that provide detailed information about App in Google Play Store, including the name, category, Rating, Reviews and so on.

In the report, I’m going to analyze the data in the following 5 aspects:
1. Import packages: set up the environment for the whole project
2. Exploratory Data Analysis: Analyze and visualize the data
3. Linear Model Analysis: use linear model to fit the data
4. Tree-based Model Analysis: use tree-based model to fit the data
5. Summary: Make a conclusion

In this data set, I want to find how to predict game reviews(continuous,linear model) and Installs(discrete,tree-based model) using the known information.

1. Import Packages

# creat a new environment
rm(list=ls())
# import packages 
library(MASS)
library(readr)
library(ggplot2)
library(corrplot)
library(Amelia)
library(reshape2)
library(caret)
library(caTools)
library(dplyr)
library(tidyr)
library(plotly)
library(texreg)
library(reshape2)
library(leaps)
library(rpart)
library(rpart.plot)
library(e1071)
google_data <- read_csv('googleplaystore.csv')
head(google_data)
## # A tibble: 6 x 13
##   App   Category Rating Reviews Size  Installs Type  Price `Content Rating`
##   <chr> <chr>     <dbl> <chr>   <chr> <chr>    <chr> <chr> <chr>           
## 1 Phot~ ART_AND~    4.1 159     19    10000    Free  0     Everyone        
## 2 Colo~ ART_AND~    3.9 967     14    500000   Free  0     Everyone        
## 3 "U L~ ART_AND~    4.7 87510   8.7   5000000  Free  0     Everyone        
## 4 Sket~ ART_AND~    4.5 215644  25    50000000 Free  0     Teen            
## 5 Pixe~ ART_AND~    4.3 967     2.8   100000   Free  0     Everyone        
## 6 Pape~ ART_AND~    4.4 167     5.6   50000    Free  0     Everyone        
## # ... with 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current
## #   Ver` <chr>, `Android Ver` <chr>

2. Data Cleaning

# Draw the data missmap
missmap(google_data, legend=FALSE)

The main NA data are in the ‘Rating’ column, so we decided to use drop_na() function to drop the row containing NA data.

# drop the null data from the original data set
google_data <- drop_na(google_data)
# see first few lines of data
head(google_data)
## # A tibble: 6 x 13
##   App   Category Rating Reviews Size  Installs Type  Price `Content Rating`
##   <chr> <chr>     <dbl> <chr>   <chr> <chr>    <chr> <chr> <chr>           
## 1 Phot~ ART_AND~    4.1 159     19    10000    Free  0     Everyone        
## 2 Colo~ ART_AND~    3.9 967     14    500000   Free  0     Everyone        
## 3 "U L~ ART_AND~    4.7 87510   8.7   5000000  Free  0     Everyone        
## 4 Sket~ ART_AND~    4.5 215644  25    50000000 Free  0     Teen            
## 5 Pixe~ ART_AND~    4.3 967     2.8   100000   Free  0     Everyone        
## 6 Pape~ ART_AND~    4.4 167     5.6   50000    Free  0     Everyone        
## # ... with 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current
## #   Ver` <chr>, `Android Ver` <chr>
# see the data type of the data set
str(google_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    7683 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite <e2><U+0080>?FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : chr  "159" "967" "87510" "215644" ...
##  $ Size          : chr  "19" "14" "8.7" "25" ...
##  $ Installs      : chr  "10000" "500000" "5000000" "50000000" ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : chr  "0" "0" "0" "0" ...
##  $ Content Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last Updated  : chr  "7-Jan-18" "15-Jan-18" "1-Aug-18" "8-Jun-18" ...
##  $ Current Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...

As we directly import the data set from csv. file, the data type of these columns are wrong. We need correct them manually.

# change data type of the original data set
google_data$App <- as.character(google_data$App)
google_data$Reviews <- as.numeric(as.character(google_data$Reviews))
google_data$Price <- as.numeric(as.character(google_data$Price))
google_data$Size <- as.numeric(google_data$Size)
google_data$`Current Ver` <- as.character(google_data$`Current Ver`)
google_data$`Android Ver` <- as.character(google_data$`Android Ver`)
google_data$Category <- as.factor(google_data$Category)
google_data$Genres <- as.factor(google_data$Genres)
google_data$`Content Rating` <- as.factor(google_data$`Content Rating`)
google_data$Installs <- as.numeric(google_data$Installs)
google_data <- google_data[-which(names(google_data)=='Type')]
# get a summary information of the data set
summary(google_data)
##      App                       Category        Rating     
##  Length:7683        FAMILY         :1602   Min.   :1.000  
##  Class :character   GAME           : 966   1st Qu.:4.000  
##  Mode  :character   TOOLS          : 627   Median :4.300  
##                     MEDICAL        : 324   Mean   :4.174  
##                     LIFESTYLE      : 279   3rd Qu.:4.500  
##                     PERSONALIZATION: 278   Max.   :5.000  
##                     (Other)        :3607                  
##     Reviews              Size            Installs        
##  Min.   :       1   Min.   :  0.023   Min.   :1.000e+00  
##  1st Qu.:     107   1st Qu.:  6.000   1st Qu.:1.000e+04  
##  Median :    2310   Median : 16.000   Median :1.000e+05  
##  Mean   :  296209   Mean   : 36.987   Mean   :8.459e+06  
##  3rd Qu.:   38825   3rd Qu.: 37.000   3rd Qu.:1.000e+06  
##  Max.   :44893888   Max.   :994.000   Max.   :1.000e+09  
##                                                          
##      Price                 Content Rating           Genres    
##  Min.   :  0.000   Adults only 18+:   2   Tools        : 627  
##  1st Qu.:  0.000   Everyone       :6138   Entertainment: 446  
##  Median :  0.000   Everyone 10+   : 316   Education    : 417  
##  Mean   :  1.132   Mature 17+     : 363   Medical      : 324  
##  3rd Qu.:  0.000   Teen           : 863   Action       : 321  
##  Max.   :400.000   Unrated        :   1   Lifestyle    : 278  
##                                           (Other)      :5270  
##  Last Updated       Current Ver        Android Ver       
##  Length:7683        Length:7683        Length:7683       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
#since there are some blanks in the name of features, we decided to rename it
data <-rename(google_data,content_rating=`Content Rating`, Android_ver= `Android Ver`,current_ver = `Current Ver`)
head(data)
## # A tibble: 6 x 12
##   App   Category Rating Reviews  Size Installs Price content_rating Genres
##   <chr> <fct>     <dbl>   <dbl> <dbl>    <dbl> <dbl> <fct>          <fct> 
## 1 Phot~ ART_AND~    4.1     159  19      10000     0 Everyone       Art &~
## 2 Colo~ ART_AND~    3.9     967  14     500000     0 Everyone       Art &~
## 3 "U L~ ART_AND~    4.7   87510   8.7  5000000     0 Everyone       Art &~
## 4 Sket~ ART_AND~    4.5  215644  25   50000000     0 Teen           Art &~
## 5 Pixe~ ART_AND~    4.3     967   2.8   100000     0 Everyone       Art &~
## 6 Pape~ ART_AND~    4.4     167   5.6    50000     0 Everyone       Art &~
## # ... with 3 more variables: `Last Updated` <chr>, current_ver <chr>,
## #   Android_ver <chr>

3. Exploraroy Data Analysis

We intend to analyze the data in the following aspects

The category with highest market share in the market

attach(google_data)
# Compute Market Share of every Categories
share_count <- c()
index <- 1
# the loop is to calculate the market share of every category
for( i in levels(Category)){
  temp_df <- google_data[which(Category == i),]
  share_count[index] <- dim(temp_df)[1]
  index <- index + 1
}
df <- data.frame(levels(Category),share_count)
df <- df[order(df$share_count,decreasing = TRUE),]
p <- plot_ly(df,labels=~levels.Category.,values=~share_count,type='pie')
p

Finding
1. nearly half of the market share was dominated by games from ‘Family’(20.9%),‘Game’(12.6%),‘Tools’(8.16%), ‘Medical’( 4.22%),‘Lifestyle’(3.63%) categories.
2. the distribution of market share is uneven. the gap between the best and the worst category is very large

The distributions of Apps

# draw the ggplot image 
# image_2 is the distribution of App rating 
image_2 <- ggplotly(ggplot(google_data, aes(x=Rating)) + geom_area(stat="bin",fill='#1E90FF') +geom_vline(xintercept = mean(Rating),col='red',lty=3,lwd = 1 )+xlab("Rating Score")+ylab("Number of App") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.y = element_text(angle=90,hjust=1)))
# image_3 is the distribution of App reviews
image_3 <- ggplotly(ggplot(google_data, aes(x=Reviews)) + geom_area(stat="bin",fill='#98FB98') +geom_vline(xintercept = mean(Reviews),col='red',lty=3,lwd = 1 )+xlab("Reviews Amount")+ylab("Number of App") + ggtitle("Distribution of Different Features") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.y = element_text(angle=90,hjust=1)))
# image_4 is the distribution of App size 
image_4 <- ggplotly(ggplot(google_data, aes(x=Size)) + geom_area(stat="bin",fill='#DAA520') +geom_vline(xintercept = mean(Size),col='red',lty=3,lwd = 1 )+xlab("Reviews Amount")+ylab("Number of App") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1),axis.text.y = element_text(angle=90,hjust=1)))
# image_5 is the distribution of App installs
image_5 <- ggplotly(ggplot(google_data, aes(x=Installs)) + geom_area(stat="bin",fill='#D2691E') +geom_vline(xintercept = mean(Installs),col='red',lty=3,lwd = 1 )+xlab("Reviews Amount")+ylab("Number of App") +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1),axis.text.y = element_text(angle=90,hjust=1)))
subplot(image_2,image_3,image_4,image_5,nrows=2, margin = 0.05)

Finding
1. Most apps do well in the game market and get an average score of 4.17;the low score area and high score area are both very samll
2. Most apps receive reviews lower in the interval [7000,300000]
3. The size of the game is usually smalle than 100M
4. The mean install amount is 8458767

The rating in different categories

attach(google_data)
## The following objects are masked from google_data (pos = 3):
## 
##     Android Ver, App, Category, Content Rating, Current Ver,
##     Genres, Installs, Last Updated, Price, Rating, Reviews, Size
# compute the average Rating score by group
cat_data <- group_by(google_data,Category)
data <- summarise(cat_data,count = n(),rating_score = mean(Rating))
# sort the data by descending order
data <- data[order(data$rating_score,decreasing = TRUE),]
# iamge_6 is the different rating grouped by 'categories'
image_6 <- ggplot(data = data,aes(x = levels(Category),y = rating_score,fill=levels(Category))) + geom_bar(stat = 'identity', position = 'dodge')+theme(axis.text.x = element_text(angle = 90, hjust = 1))+geom_hline(aes(yintercept=4), colour="white", linetype="dashed")+theme(panel.border = element_blank())+ coord_cartesian(ylim=c(3.5, 4.5)) +xlab("Categories in Google Play Store")+ylab("Rating")+ ggtitle("Rating in Different Categories") +theme(plot.title = element_text(hjust = 0.5))+scale_fill_discrete(name="Category")
ggplotly(image_6)

Finding
1. There is some differece in ‘Rating’ among categories, but the difference is way too small
2. ‘Art and Design’ Category always has the highest Rating score, and ‘Auto and Vehicles’ Category cathes up

The reviews in different categories

# calculate the reviews grouped by 'categories'
cat_data <- group_by(google_data,Category)
cat_data <- summarise(cat_data,count = n(),reviews = sum(Reviews))
# 'p' respresents 'Reviews' grouped by 'categories'
p <- google_data %>%
  plot_ly(
    x = ~Category,
    y = ~Reviews,
    split = ~Category,
    type = 'violin',
    box = list(
      visible = T
    ),
    meanline = list(
      visible = T
    )
  ) %>% 
  layout(
    title = "Reviews in Different Categories",
    xaxis = list(
      title = "Categories in Google Play Store"
    ),
    yaxis = list(
      title = "Reviews",
      zeroline = F
    )
  )
p

Finding
1. Almost all app categories perform decently. Health and Fitness and Books and Reference produce the highest quality
2. apps with 50% apps having a rating greater than 4.5.
3. On the contrary, 50% of apps in the Dating category have a rating lesser than the average rating.

# Fitted line by 'Rating' and 'Reviews'
line_1 <- lm(Rating~Reviews,data = google_data)
new <- data.frame(google_data$Reviews)
y <- predict(line_1,newdata = new)
p <- plot_ly(data = google_data, x = ~google_data$Reviews, y = ~google_data$Rating,type = 'scatter',name='Actual Points') %>%add_trace(y = ~y, name = 'Linear Regression', mode = 'lines')%>% 
  layout(
    title = "Rating vs Reviews",
    xaxis = list(
      title = "Reviews"
    ),
    yaxis = list(
      title = "Rating",
      zeroline = F
    )
  )
p
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Finding
1. Rating and Reviews seems not have a very strong relationship
2. The fitted line shows that every increase in Reviews only leads to limited increase in Rating score

# change the data type and pick the numeric ones from all the columns
google_data$Category <- as.numeric(google_data$Category)
google_data$Genres <- as.numeric(google_data$Genres)
google_data$`Content Rating` <- as.numeric(google_data$`Content Rating`)
google_num <- select_if(google_data,is.numeric,genres = google_data$Genres,content_rating = google_data$`Content Rating`)
google_num <- mutate(google_num,Category=google_data$Category)

T Test

Claim:The average value of ‘rating’ equals to 4

# Test if the average value of 'rating' equals to 4 
t.test(Rating,mu = 4,alt = "two.sided", conf=0.95,data=google_data)
## 
##  One Sample t-test
## 
## data:  Rating
## t = 27.907, df = 7682, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4
## 95 percent confidence interval:
##  4.161373 4.185757
## sample estimates:
## mean of x 
##  4.173565

Finding
1. p-value in T-Test is less than 0.05, which means our claim can be accepted in the 95% cofidencee interval

4. Linear Model Analysis

Correlation Map

#draw the correlation map of different features in numeric columns of google_data
cor_google <- cor(google_num)
melted_cor <- melt(cor_google)
cor_image <- ggplot(data = melted_cor, aes(x=Var1, y=Var2, fill=value)) + geom_tile() +xlab("")+ylab("")+ ggtitle("The Correlation of Different Features") +theme(plot.title = element_text(hjust = 0.5)) 
ggplotly(cor_image)

Here I change the data type of some columns again and visulaize the correlation of them using the map above The brighter the square is, the higher relationship the two features will be.
Finding
1. ‘Reviews and Installs’ / ‘Genres and Category’ have a significant positive relationship
2. Most of the features are not so tightly correlated

Polynomial Model

#set a seed 
set.seed(123)
#Split the data to train set and test set
split = sample.split(google_num,SplitRatio =0.75)
train = subset(google_num,split==TRUE)
## Warning: Length of logical index must be 1 or 7683, not 8
test = subset(google_num,split=FALSE)
set.seed(123)
train.control = trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)

regsubsets.out <- regsubsets( Reviews ~ .,
                              data = train,
                              nbest = 1,
                              nvmax = NULL,
                              force.in = NULL, force.out = NULL,
                              method = 'forward')
summary(regsubsets.out)
## Subset selection object
## Call: regsubsets.formula(Reviews ~ ., data = train, nbest = 1, nvmax = NULL, 
##     force.in = NULL, force.out = NULL, method = "forward")
## 7 Variables  (and intercept)
##                  Forced in Forced out
## Category             FALSE      FALSE
## Rating               FALSE      FALSE
## Size                 FALSE      FALSE
## Installs             FALSE      FALSE
## Price                FALSE      FALSE
## `Content Rating`     FALSE      FALSE
## Genres               FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
##          Category Rating Size Installs Price `Content Rating` Genres
## 1  ( 1 ) " "      " "    " "  "*"      " "   " "              " "   
## 2  ( 1 ) " "      "*"    " "  "*"      " "   " "              " "   
## 3  ( 1 ) " "      "*"    "*"  "*"      " "   " "              " "   
## 4  ( 1 ) " "      "*"    "*"  "*"      " "   " "              "*"   
## 5  ( 1 ) "*"      "*"    "*"  "*"      " "   " "              "*"   
## 6  ( 1 ) "*"      "*"    "*"  "*"      " "   "*"              "*"   
## 7  ( 1 ) "*"      "*"    "*"  "*"      "*"   "*"              "*"
model_1 <- lm(Reviews ~.-Price,data=train)
# use train_validation method to train linear model
model_1_cv= train(Reviews ~.-Price , data = train, method = "lm",
               trControl = train.control)
model_1_cv
## Linear Regression 
## 
## 5763 samples
##    7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 5187, 5186, 5187, 5187, 5186, 5187, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   1391093  0.4606649  302535.2
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Here I splitted the original data set into train set and test set, and use foward mothod to see what the best polynomial model is
Finding
1. The image tells us that the best polynomial model should have all the variables except ‘Price’
2. The performance of the model is so so, reaching 0.46 in R-squared

Polynomial model with interactions

# since the pure polynomial model can not fit the data very well, we try to add some interactions into the model
# Since there is strong relationship between Reviews and Installs, we try to add this first
model_2 <- lm(Reviews ~ Size*Installs+`Content Rating`+Genres+Category,data=train)
# model_2_cv considerates the interaction between different features
model_2_cv= train(Reviews ~ Size*Installs*Rating, data = google_num, method = "lm",
               trControl = train.control)
model_2_cv
## Linear Regression 
## 
## 7683 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 6915, 6914, 6915, 6915, 6914, 6914, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   1199619  0.6297199  250629.3
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Finding
1. The performance of the model is far better than the previous one, reaching 0.60 in R-squared

Compare the two models

# In order to know the performance of the two models, we use anova function to get F-Test
anova(model_1,model_2)
## Analysis of Variance Table
## 
## Model 1: Reviews ~ (Category + Rating + Size + Installs + Price + `Content Rating` + 
##     Genres) - Price
## Model 2: Reviews ~ Size * Installs + `Content Rating` + Genres + Category
##   Res.Df        RSS Df  Sum of Sq F Pr(>F)
## 1   5756 1.2336e+16                       
## 2   5756 9.9482e+15  0 2.3882e+15

Finding
1. The p-value of the model is very low, which means the added parameter – interaction is pretty important to imrove the performance of the model

Visualize the performance of two models

# visualize the performance of the model_1_cv
pred_1 <- predict(model_1_cv,test)
google_num <- mutate(google_num, pred_1 = pred_1)
image_3 <- ggplot(data = google_num,aes(x = test$Reviews, y = pred_1)) + geom_point(stat='identity')+ geom_point(stat='identity')+geom_abline(slope= 1,intercept = 0,colour='orange')+xlab("Actual Values")+ylab("Predicted Values")+xlab("Actual Values")+ylab("Predicted Values")+theme_bw()
ggplotly(image_3)
# visualize the performance of the model_2_cv
pred_2 <- predict(model_2_cv,test)
google_num <- mutate(google_num, pred_2 = pred_2)
image_4 <- ggplot(data = google_num,aes(x = test$Reviews, y = pred_2)) + geom_point(stat='identity')+geom_abline(slope= 1,intercept = 0,colour='orange')+xlab("Actual Values")+ylab("Predicted Values")+theme_bw()
ggplotly(image_4) 

Here the two images show the different performance of Linear Model Apparently the second(with interactions) is fat better than than the first one

4. Tree-based Model Analysis

Decision Tree

#Since 'Installs' is the dicrete data, we want to predict the installs an App will be(eg:500+, 5000+)
train <- mutate(train,Installs= as.factor(Installs))
# set up a tree classification
classifier_tree = train(Installs ~Reviews+Rating+Category+Size+Genres, data = train, method = "rpart",parms = list(split = "information"),trControl=train.control,tuneLength = 10)
# visualize the tree model
plot(classifier_tree$finalModel)
text(classifier_tree$finalModel)

# use prp function to visualize the tree model
prp(classifier_tree$finalModel, box.palette = "Reds", tweak = 1.2)

# print the detailed information about tree model
print(classifier_tree)
## CART 
## 
## 5763 samples
##    5 predictor
##   19 classes: '1', '5', '10', '50', '100', '500', '1000', '5000', '10000', '50000', '1e+05', '5e+05', '1e+06', '5e+06', '1e+07', '5e+07', '1e+08', '5e+08', '1e+09' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 5185, 5188, 5188, 5186, 5187, 5187, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.001800679  0.5220112  0.4594452
##   0.001973821  0.5218374  0.4591094
##   0.002701018  0.5193497  0.4560349
##   0.004570954  0.5142599  0.4492838
##   0.016621650  0.4804716  0.4091054
##   0.023270310  0.4593637  0.3811746
##   0.065447746  0.3982189  0.3086372
##   0.092457926  0.3617935  0.2634337
##   0.094951174  0.3201316  0.2101014
##   0.129648868  0.2828969  0.1618843
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001800679.

Confusion Matrix

y_pred = predict(classifier_tree, newdata = test)
df<-data.frame(table(test$Installs, y_pred))
df <- mutate(df,Var1 = as.numeric(Var1),y_pred=as.numeric(y_pred),error=Freq)
# cor_image is to visualze the confusion matrix of decision tree model
cor_image <- ggplot(data = df, aes(x=Var1, y=y_pred, fill=error)) + geom_tile() +xlab("Actual Values")+ylab("Predicted Values")+ ggtitle("Confusion Matrix") +theme(plot.title = element_text(hjust = 0.5))
ggplotly(cor_image)
error <- mean(test$Installs != y_pred) # Misclassification error
paste('Accuracy',round(1-error,4))
## [1] "Accuracy 0.534"

Decision Tree(Pruned with Length 5)

classifier_tree = train(Installs ~Reviews+Rating+Category+Size+Genres, data = train, method = "rpart",parms = list(split = "information"),trControl=train.control,tuneLength = 5)
# visualize the tree model
plot(classifier_tree$finalModel)
text(classifier_tree$finalModel)

# use prp function to visualize the tree model
prp(classifier_tree$finalModel, box.palette = "Reds", tweak = 1.2)

# print the detailed information about tree model
print(classifier_tree)
## CART 
## 
## 5763 samples
##    5 predictor
##   19 classes: '1', '5', '10', '50', '100', '500', '1000', '5000', '10000', '50000', '1e+05', '5e+05', '1e+06', '5e+06', '1e+07', '5e+07', '1e+08', '5e+08', '1e+09' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 5186, 5187, 5185, 5187, 5184, 5188, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.02327031  0.4582076  0.3800382
##   0.06544775  0.3949644  0.3047400
##   0.09245793  0.3615083  0.2631028
##   0.09495117  0.3274837  0.2196920
##   0.12964887  0.2827252  0.1617018
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02327031.

Confusion Matrix

y_pred = predict(classifier_tree, newdata = test)
df<-data.frame(table(test$Installs, y_pred))
df <- mutate(df,Var1 = as.numeric(Var1),y_pred=as.numeric(y_pred),error=Freq)
# cor_image is to visualze the confusion matrix of decision tree model
cor_image <- ggplot(data = df, aes(x=Var1, y=y_pred, fill=error)) + geom_tile() +xlab("Actual Values")+ylab("Predicted Values")+ ggtitle("Confusion Matrix") +theme(plot.title = element_text(hjust = 0.5))
ggplotly(cor_image)
error <- mean(test$Installs != y_pred) # Misclassification error
paste('Accuracy',round(1-error,4))
## [1] "Accuracy 0.4588"

Random Forest

mtry = sqrt(ncol(train))
tunegrid = expand.grid(.mtry=mtry)
metric = "Accuracy"
# set up a random forest model to predict installs
classifier_rf = train(Installs ~Reviews+Rating+Category+Size+Genres, data = train, method = "rf",
                      metric=metric, tuneGrid=tunegrid, trControl=train.control,  tuneLength = 5)
# print detailed information about random forest model
print(classifier_rf)
## Random Forest 
## 
## 5763 samples
##    5 predictor
##   19 classes: '1', '5', '10', '50', '100', '500', '1000', '5000', '10000', '50000', '1e+05', '5e+05', '1e+06', '5e+06', '1e+07', '5e+07', '1e+08', '5e+08', '1e+09' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 5185, 5185, 5185, 5188, 5189, 5186, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5672261  0.5159021
## 
## Tuning parameter 'mtry' was held constant at a value of 2.828427

Confusion Matrix II

y_pred = predict(classifier_rf, newdata = test)
# Checking the prediction accuracy
df<-data.frame(table(test$Installs, y_pred)) # Confusion matrix
df <- mutate(df,Var1 = as.numeric(Var1),y_pred=as.numeric(y_pred),error=Freq)
cor_image <- ggplot(data = df, aes(x=Var1, y=y_pred, fill=error)) + geom_tile() +xlab("Actual Values")+ylab("Predicted Values")+ ggtitle("Confusion Matrix") +theme(plot.title = element_text(hjust = 0.5))
ggplotly(cor_image)
error <- mean(test$Installs != y_pred) # Misclassification error
paste('Accuracy',round(1-error,4))
## [1] "Accuracy 0.8916"

5. Summary

From the analysis report above, we have got 4 models already. let’s do some comparison between the models
1. Linear Model
- anova test tells us already that polynomial model with interactions is better than polynomial model
- compared to the previous one, the new model has a higher r-square, which means that it can explain more data
2. Tree-based Model
- based on confusion matrix, we could find that Random Forest Model performs better than Decision Tree Modle
- The Missclassfication error of random forest model is far less than the decision tree model
- Thus we could make a conclusion that random forest is a better model


Potential Next Steps:
1. evaluate the tree model using more metrics
2. Analyze the categorical variable and see if there are some correlationship among them. (Maybe NLP if possible)